Today:
library(tidyverse)
library(ggbeeswarm)
library(janitor)
library(emmeans)
library(glmmTMB)## Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
## glmmTMB was built with TMB version 1.7.20
## Current TMB version is 1.7.21
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(car)
library(DHARMa)gambusia <- read_csv("Data/Gambusia.csv")## Rows: 126 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Focal_ID, Treatment, Tagged, Trial_Date, Rival_ID, Female_ID
## dbl (8): Row_ID, Trial, Focal_Size, Time_Female.sec, No.Cop_Attempts, No.Cop...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Yesterday, we fitted the model:
summary(m_poisson <- glmmTMB(No.Cop_Success ~ 1 + Treatment,
data=gambusia, family = poisson))## Family: poisson ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 624.9 630.6 -310.5 620.9 124
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.52030 0.09713 5.357 8.47e-08 ***
## TreatmentWinner 0.16487 0.13204 1.249 0.212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using simulations, we saw that this model was not a very good representation of that data. The model is not aware of all the variation in the data:
fsimul <- function(x) {
tibble(simul=simulate(m_poisson)) %>%
tabyl(simul)
}
tibbsimul2 <- map_dfr(.x = 1:100, .f = fsimul)ggplot(gambusia, aes(x=No.Cop_Success))+geom_bar() +
geom_jitter(data = tibbsimul2,
mapping = aes(x=simul, y=n, col=as.factor(simul)), alpha=0.4)library(DiagrammeR)
grViz("
digraph boxes_and_circles {
node [shape = square, color = DeepSkyBlue]
response; predictor
node [shape = diamond, color=ForestGreen]
intercept; effect
node [shape = circle, color = black]
'expected value';
'expected value' -> 'average count' [ label = ' exponential'];
'average count' -> response [ label = ' rpois(lambda=average count)', style=dashed];
intercept -> 'expected value';
predictor -> effect [arrowhead = none, label = ' \U00D7'];
effect -> 'expected value' ;
}")With a simple Poisson family, a single parameter links the average count to the response. That can be a problem because the average count control both the mean, the variance, and all aspects of the distribution of the response.
Running simulations like that to assess your models can be very useful and sometimes is required to understand what is wrong, whether it matters, or what to do about it otherwise. Fortunately, for routine checks we can assess our models more directly with the package DHARMa.
library(DHARMa)The function testResiduals() first simulate data from our model, looks at the distance between model predictions and simulated data (these distances are “residuals”), then we compare the simulated residuals to the real data residuals:
testResiduals(m_poisson)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.01210317
## sample estimates:
## outlier frequency (expected: 0.00325396825396825 )
## 0.05555556
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.01210317
## sample estimates:
## outlier frequency (expected: 0.00325396825396825 )
## 0.05555556
Here we can diagnose at least three problems:
It is exactly what we saw in our home made simulations earlier. You can use either approach.
How to build a better model? First we need to understand what is wrong exactly in the model structure.
Our GLM assumes that the data are generated by a Poisson process. Let’s simulate that process to understand it.
dat <- tibble(x=rpois(n = 10000, lambda = 2.8))
ggplot(dat, aes(x=x))+ geom_bar()mean(dat$x)## [1] 2.7872
var(dat$x)## [1] 2.802196
The Poisson distribution takes a single parameter, lambda (\(\lambda\)). That parameter controls exactly the shape of the distribution, and is equal to the expected mean and the expected variance both. That is a key issue. In a Poisson distribution, the expected mean predicts the variance. There is not room for any more variation.
In biology, this lambda is unlikely to be constant. A Poisson GLM “thinks” the predictors in the model capture all the variation in lambda, which corresponds to this set of equations:
\[ \mathrm{obs} \sim Poisson(\lambda) \\ \lambda = \mathrm{exp}(y)\\ y = \alpha + \beta x \]
That is often not sufficient. The data would probably be better represented by
\[ \mathrm{obs} \sim Poisson(\lambda) \\ \lambda = \mathrm{exp}(y)\\ y = \alpha + \beta x + noise \] That is, there is unexplained variation after accounting for the model predictors. A Poisson GLM cannot see the noise, which produces over-dispersion, and measure uncertainty using only the variation it can see. Therefore, a Poisson GLM often under-estimate uncertainty.
If you forget about the problem of over-dispersion you will find too many false positive results. For instance, with a variance about 3 times larger than the mean, we find significant results 30% of the time when there is no true effect (ideally we expect 5%).
## Simulate data without any effect of x
pvalues <- vector(length=100)
for (i in 1:100)
{
x <- rnorm(50)
y <- rnorm(50)
lambda <- exp(y)
obs <- sapply(lambda, rpois, n=1)
m0 <- glmmTMB(obs ~ x, family = "poisson")
pvalues[i] <- summary(m0)$coefficients$cond["x", "Pr(>|z|)"]
}## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
## Warning in glmmTMB(obs ~ x, family = "poisson"): use of the 'data' argument is
## recommended
mean(pvalues < 0.05)## [1] 0.31
Basically, we need to give some freedom to the mean-variance relationship. There are many ways of doing that; some will work better or worse for particular datasets. There is no silver bullet. You need to try and assess how good the model is.
A first version, called “nbinom1” (“Negative-Binomial 1”), is an extension of the Poisson model in which the variance follows the equation $ V= m (1+)$ where \(\phi\) is the over-dispersion.
summary(m_poisson_2 <- glmmTMB(No.Cop_Success ~ 1 + Treatment,
data=gambusia, family = nbinom1))## Family: nbinom1 ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 454.2 462.8 -224.1 448.2 123
##
##
## Dispersion parameter for nbinom1 family (): 4.09
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5218 0.1974 2.644 0.00819 **
## TreatmentWinner 0.1621 0.2405 0.674 0.50038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s evaluate this model using DHARMa and then looking at simulated distributions:
testResiduals(m_poisson) # previous evaluation## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.01587302
## sample estimates:
## outlier frequency (expected: 0.00341269841269841 )
## 0.05555556
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.3233, p-value = 7.277e-12
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 3.8684, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 7, observations = 126, p-value < 2.2e-16
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.01587302
## sample estimates:
## outlier frequency (expected: 0.00341269841269841 )
## 0.05555556
testResiduals(m_poisson_2) #much better## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.067347, p-value = 0.6171
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.76617, p-value = 0.496
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 0, observations = 126, p-value = 1
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.02380952
## sample estimates:
## outlier frequency (expected: 0.00611111111111111 )
## 0
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.067347, p-value = 0.6171
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.76617, p-value = 0.496
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 0, observations = 126, p-value = 1
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.02380952
## sample estimates:
## outlier frequency (expected: 0.00611111111111111 )
## 0
fsimul <- function(x) {
tibble(simul=simulate(m_poisson)) %>%
tabyl(simul)
}
tibbsimul2 <- map_dfr(.x = 1:100, .f = fsimul)
ggplot(gambusia, aes(x=No.Cop_Success))+geom_bar() +
geom_jitter(data = tibbsimul2,
mapping = aes(x=simul, y=n, col=as.factor(simul)),
alpha=0.4, inherit.aes = FALSE) + ggtitle("Poisson GLM")fsimul <- function(x) {
tibble(simul=simulate(m_poisson_2)) %>%
tabyl(simul)
}
tibbsimul2 <- map_dfr(.x = 1:100, .f = fsimul)
ggplot(gambusia, aes(x=No.Cop_Success))+geom_bar() +
geom_jitter(data = tibbsimul2,
mapping = aes(x=simul, y=n, col=as.factor(simul)),
alpha=0.4, inherit.aes = FALSE) + xlim(-1,12)+ ggtitle("Nbinom1 GLM")## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 201 rows containing missing values (geom_point).
In this case the “Negative Binomial 1” seems appropriate, and much better than the strict Poisson family.
In general avoid the Poisson family.
(unless you model over-dispersion within the Poisson family, using random effects or a secondary dispersion linear predictor; we will not cover this topic in the course. Just remember it is dangerous but not necessarily wrong if you know what you are doing!)
In glmmTMB there are 5 basic GLM families for count data:
summary(m_poisson <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = poisson))## Family: poisson ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 624.9 630.6 -310.5 620.9 124
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.52030 0.09713 5.357 8.47e-08 ***
## TreatmentWinner 0.16487 0.13204 1.249 0.212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_genpoi <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = genpois)) ## Family: genpois ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 458.7 467.2 -226.4 452.7 123
##
##
## Dispersion parameter for genpois family (): 5.84
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5193 0.2050 2.533 0.0113 *
## TreatmentWinner 0.1668 0.2389 0.698 0.4852
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_compois <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = compois))## Slow!## Family: compois ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 469.0 477.5 -231.5 463.0 123
##
##
## Dispersion parameter for compois family (): 1.74e+09
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5203 0.1591 3.271 0.00107 **
## TreatmentWinner 0.1649 0.2218 0.743 0.45720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_nbinom1 <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = nbinom1))## Family: nbinom1 ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 454.2 462.8 -224.1 448.2 123
##
##
## Dispersion parameter for nbinom1 family (): 4.09
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5218 0.1974 2.644 0.00819 **
## TreatmentWinner 0.1621 0.2405 0.674 0.50038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m_nbinom2 <- glmmTMB(No.Cop_Success ~ 1 + Treatment, data=gambusia, family = nbinom2))## Family: nbinom2 ( log )
## Formula: No.Cop_Success ~ 1 + Treatment
## Data: gambusia
##
## AIC BIC logLik deviance df.resid
## 454.4 462.9 -224.2 448.4 123
##
##
## Dispersion parameter for nbinom2 family (): 0.447
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5203 0.2119 2.455 0.0141 *
## TreatmentWinner 0.1649 0.2973 0.555 0.5792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All of them return (almost exactly) the same parameter estimates: intercept = 0.52 and TreatmentWinner=0.16. However, they return different Standard Errors, z values and p values (the poisson family being very different from the other ones), because they have different ideas about how variation in the data works and therefore about how sure we should be about what patterns we see.
Each family may be more or less appropriate for different response variables. It really depends on how variation relates to the response expected mean value. Sometimes, not a single one of these families will provide a good description of variation. In those cases adding random effect, or using a family with zero-inflation or a linear predictor for dispersion can be necessary (not covered in this course).
Sometimes a count data GLM does not fit well and shows signs of under/over-dispersion because important predictors are missing in the model.
For instance, let’s simulate data where a count data “obs” is a function of size and size on the log-scale:
set.seed(351188)
fakedata <- tibble(size=rnorm(1000), sex=sample(c(1,2), size=1000, replace = TRUE, prob = c(0.2, 0.8)),
y=1 + 0.5*size + 1.6*sex + rnorm(n = 1000, 0, 0.02), obs=rpois(n = 1000, lambda = exp(y)))We model obs as a function of size only:
summary(fake_nbinom1 <- glmmTMB(obs ~ 1 + size, data=fakedata, family = nbinom1)) ## Family: nbinom1 ( log )
## Formula: obs ~ 1 + size
## Data: fakedata
##
## AIC BIC logLik deviance df.resid
## 9595.1 9609.8 -4794.5 9589.1 997
##
##
## Dispersion parameter for nbinom1 family (): 17.8
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.03847 0.01891 213.61 <2e-16 ***
## size 0.44591 0.01668 26.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.276, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.70967, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 10, observations = 1000, p-value = 0.472
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.004805511 0.018313243
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.01
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.276, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.70967, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 10, observations = 1000, p-value = 0.472
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.004805511 0.018313243
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.01
Here we have a case of under-dispersion. The model does not fit the data well and over-estimates the amount of noise in the data. In the presence of under-dispersion we expect to find fewer false positives than normal, but also more false negative; that is, our statistical model has less power to detect effects than it should.
Sometimes you can still get an okay model by fitting a different GLM, but adding the missing predictor in the model solves the issue much better:
summary(fake_nbinom1_sex <- glmmTMB(obs ~ 1 + size + sex, data=fakedata, family = nbinom1)) ## Family: nbinom1 ( log )
## Formula: obs ~ 1 + size + sex
## Data: fakedata
##
## AIC BIC logLik deviance df.resid
## 6757.8 6777.4 -3374.9 6749.8 996
##
##
## Dispersion parameter for nbinom1 family (): 0.0503
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.000227 0.036553 27.36 <2e-16 ***
## size 0.497551 0.004066 122.37 <2e-16 ***
## sex 1.597368 0.018613 85.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1_sex)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.008
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.008
With the same data, if we fit sex only but not size, we get over-dispersion:
summary(fake_nbinom1_sex_only <- glmmTMB(obs ~ 1 + sex, data=fakedata, family = nbinom1)) ## Family: nbinom1 ( log )
## Formula: obs ~ 1 + sex
## Data: fakedata
##
## AIC BIC logLik deviance df.resid
## 9429.3 9444.1 -4711.7 9423.3 997
##
##
## Dispersion parameter for nbinom1 family (): 15.4
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.67100 0.10798 15.47 <2e-16 ***
## sex 1.32142 0.05522 23.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1_sex_only)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.052034, p-value = 0.008897
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.3403, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 11, observations = 1000, p-value = 0.2807
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.005503594 0.019596628
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.011
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.052034, p-value = 0.008897
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.3403, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 11, observations = 1000, p-value = 0.2807
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.005503594 0.019596628
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.011
The issue disappears when we include size in the model:
summary(fake_nbinom1_sex <- glmmTMB(obs ~ 1 + size + sex, data=fakedata, family = nbinom1)) ## Family: nbinom1 ( log )
## Formula: obs ~ 1 + size + sex
## Data: fakedata
##
## AIC BIC logLik deviance df.resid
## 6757.8 6777.4 -3374.9 6749.8 996
##
##
## Dispersion parameter for nbinom1 family (): 0.0503
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.000227 0.036553 27.36 <2e-16 ***
## size 0.497551 0.004066 122.37 <2e-16 ***
## sex 1.597368 0.018613 85.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(fake_nbinom1_sex)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.008
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012523, p-value = 0.9976
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0193, p-value = 0.68
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: simulationOutput
## outliers at both margin(s) = 8, observations = 1000, p-value = 0.8594
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.003459976 0.015702049
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.008
Time for practice!